In [25]:

    
import graphlab as gl
gl.canvas.set_target("ipynb")



In [26]:

    
implicit = gl.SFrame('implicit')
explicit = gl.SFrame('explicit')
items = gl.SFrame('items')
ratings = gl.SFrame('ratings')



In [5]:

    
ratings.show()

Split the data into a training set and a test set

This allows us to evaluate generalization ability.



In [27]:

    
train, valid = gl.recommender.util.random_split_by_user(implicit)

Feature engineering

Compute the number of times each item has been rated.



In [28]:

    
num_ratings_per_item = train.groupby('item_id', {'num_users': gl.aggregate.COUNT})
items = items.join(num_ratings_per_item, on='item_id')

Transform the count into a categorical variable using the feature_engineering module.



In [29]:

    
binner = gl.feature_engineering.FeatureBinner(features=['num_users'], strategy='logarithmic', num_bins=5)
items = binner.fit_transform(items)

Convert each genre element into a dictionary and each year to an integer.



In [30]:

    
items['genres'] = items['genres'].apply(lambda x: {k:1 for k in x})
items['year'] = items['year'].astype(int)



In [31]:

    
items









    Out[31]:





    
        item_id
        genres
        title
        year
        num_users
    
    
        1
        {"Children's": 1,
'Comedy': 1, 'Animati ...
        Toy Story
        1995
        num_users_4
    
    
        2
        {"Children's": 1,
'Adventure': 1, ...
        Jumanji
        1995
        num_users_3
    
    
        3
        {'Romance': 1, 'Comedy':
1} ...
        Grumpier Old Men
        1995
        num_users_3
    
    
        4
        {'Drama': 1, 'Comedy': 1}
        Waiting to Exhale
        1995
        num_users_2
    
    
        5
        {'Comedy': 1}
        Father of the Bride Part
II ...
        1995
        num_users_2
    
    
        6
        {'Action': 1, 'Thriller':
1, 'Crime': 1} ...
        Heat
        1995
        num_users_3
    
    
        7
        {'Romance': 1, 'Comedy':
1} ...
        Sabrina
        1995
        num_users_3
    
    
        8
        {"Children's": 1,
'Adventure': 1} ...
        Tom and Huck
        1995
        num_users_2
    
    
        9
        {'Action': 1}
        Sudden Death
        1995
        num_users_2
    
    
        10
        {'Action': 1,
'Adventure': 1, ...
        GoldenEye
        1995
        num_users_3
    

[3529 rows x 5 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

Train models

Collaborative filtering approach that uses the Jaccard similarity of two users' item lists



In [32]:

    
m0 = gl.item_similarity_recommender.create(train)









    




Recsys training: model = item_similarity






    




Warning: Column 'score' ignored.






    




    To use this column as the target, set target = "score" and use a method that allows the use of a target.






    




Preparing data set.






    




    Data has 556371 observations with 6038 users and 3529 items.






    




    Data prepared in: 0.489734s






    




Computing item similarity statistics:






    




Computing most similar items for 3529 items:






    




+-----------------+-----------------+






    




| Number of items | Elapsed Time    |






    




+-----------------+-----------------+






    




| 1000            | 0.80228         |






    




| 2000            | 0.885286        |






    




| 3000            | 0.969132        |






    




+-----------------+-----------------+






    




Finished training in 1.17977s

Collaborative filtering approach that learns latent factors for each user and each item



In [33]:

    
m1 = gl.ranking_factorization_recommender.create(train, max_iterations=10)









    




Recsys training: model = ranking_factorization_recommender






    




Preparing data set.






    




    Data has 556371 observations with 6038 users and 3529 items.






    




    Data prepared in: 0.784596s






    




Training ranking_factorization_recommender for recommendations.






    




+--------------------------------+--------------------------------------------------+----------+






    




| Parameter                      | Description                                      | Value    |






    




+--------------------------------+--------------------------------------------------+----------+






    




| num_factors                    | Factor Dimension                                 | 32       |






    




| regularization                 | L2 Regularization on Factors                     | 1e-09    |






    




| solver                         | Solver used for training                         | adagrad  |






    




| linear_regularization          | L2 Regularization on Linear Coefficients         | 1e-09    |






    




| binary_target                  | Assume Binary Targets                            | True     |






    




| max_iterations                 | Maximum Number of Iterations                     | 10       |






    




+--------------------------------+--------------------------------------------------+----------+






    




  Optimizing model using SGD; tuning step size.






    




  Using 69546 / 556371 points for tuning the step size.






    




+---------+-------------------+------------------------------------------+






    




| Attempt | Initial Step Size | Estimated Objective Value                |






    




+---------+-------------------+------------------------------------------+






    




| 0       | 16.6667           | Not Viable                               |






    




| 1       | 4.16667           | Not Viable                               |






    




| 2       | 1.04167           | Not Viable                               |






    




| 3       | 0.260417          | Not Viable                               |






    




| 4       | 0.0651042         | No Decrease (1.47043 >= 1.38645)         |






    




| 5       | 0.016276          | 1.34543                                  |






    




| 6       | 0.00813802        | 1.35577                                  |






    




| 7       | 0.00406901        | 1.3659                                   |






    




| 8       | 0.00203451        | 1.37251                                  |






    




+---------+-------------------+------------------------------------------+






    




| Final   | 0.016276          | 1.34543                                  |






    




+---------+-------------------+------------------------------------------+






    




Starting Optimization.






    




+---------+--------------+-------------------+-----------------------------------+-------------+






    




| Iter.   | Elapsed Time | Approx. Objective | Approx. Training Predictive Error | Step Size   |






    




+---------+--------------+-------------------+-----------------------------------+-------------+






    




| Initial | 112us        | 1.38645           | 0.693158                          |             |






    




+---------+--------------+-------------------+-----------------------------------+-------------+






    




| 1       | 1.21s        | 1.33709           | 0.652715                          | 0.016276    |






    




| 2       | 2.58s        | 1.30773           | 0.643739                          | 0.016276    |






    




| 3       | 3.95s        | 1.29445           | 0.641196                          | 0.016276    |






    




| 4       | 5.29s        | 1.28572           | 0.639083                          | 0.016276    |






    




| 5       | 6.51s        | 1.2805            | 0.636927                          | 0.016276    |






    




| 6       | 7.69s        | 1.27567           | 0.635731                          | 0.016276    |






    




| 7       | 8.95s        | 1.27214           | 0.634294                          | 0.016276    |






    




| 8       | 10.13s       | 1.26873           | 0.633182                          | 0.016276    |






    




| 9       | 11.33s       | 1.26672           | 0.632232                          | 0.016276    |






    




| 10      | 12.94s       | 1.26386           | 0.631565                          | 0.016276    |






    




+---------+--------------+-------------------+-----------------------------------+-------------+






    




Optimization Complete: Maximum number of passes through the data reached.






    




Computing final objective value and training Predictive Error.






    




       Final objective value: 1.27025






    




       Final training Predictive Error: 0.62752

Collaborative filtering approach that learns latent factors for users, items, and side data



In [34]:

    
m2 = gl.ranking_factorization_recommender.create(train, 
                                                 item_data=items[['item_id', 'year']], 
                                                 max_iterations=10)









    




Recsys training: model = ranking_factorization_recommender






    




Preparing data set.






    




    Data has 556371 observations with 6038 users and 3529 items.






    




    Data prepared in: 0.757925s






    




Training ranking_factorization_recommender for recommendations.






    




+--------------------------------+--------------------------------------------------+----------+






    




| Parameter                      | Description                                      | Value    |






    




+--------------------------------+--------------------------------------------------+----------+






    




| num_factors                    | Factor Dimension                                 | 32       |






    




| regularization                 | L2 Regularization on Factors                     | 1e-09    |






    




| solver                         | Solver used for training                         | adagrad  |






    




| linear_regularization          | L2 Regularization on Linear Coefficients         | 1e-09    |






    




| binary_target                  | Assume Binary Targets                            | True     |






    




| side_data_factorization        | Assign Factors for Side Data                     | True     |






    




| max_iterations                 | Maximum Number of Iterations                     | 10       |






    




+--------------------------------+--------------------------------------------------+----------+






    




  Optimizing model using SGD; tuning step size.






    




  Using 69546 / 556371 points for tuning the step size.






    




+---------+-------------------+------------------------------------------+






    




| Attempt | Initial Step Size | Estimated Objective Value                |






    




+---------+-------------------+------------------------------------------+






    




| 0       | 12.5              | Not Viable                               |






    




| 1       | 3.125             | Not Viable                               |






    




| 2       | 0.78125           | Not Viable                               |






    




| 3       | 0.195312          | Not Viable                               |






    




| 4       | 0.0488281         | No Decrease (2.00723 >= 1.38643)         |






    




| 5       | 0.012207          | No Decrease (1.70097 >= 1.38643)         |






    




| 6       | 0.00305176        | No Decrease (1.4783 >= 1.38643)          |






    




| 7       | 0.000762939       | No Decrease (1.38799 >= 1.38643)         |






    




| 8       | 0.000190735       | 1.38582                                  |






    




| 9       | 9.53674e-05       | 1.38597                                  |






    




| 10      | 4.76837e-05       | 1.38613                                  |






    




| 11      | 2.38419e-05       | 1.38622                                  |






    




+---------+-------------------+------------------------------------------+






    




| Final   | 0.000190735       | 1.38582                                  |






    




+---------+-------------------+------------------------------------------+






    




Starting Optimization.






    




+---------+--------------+-------------------+-----------------------------------+-------------+






    




| Iter.   | Elapsed Time | Approx. Objective | Approx. Training Predictive Error | Step Size   |






    




+---------+--------------+-------------------+-----------------------------------+-------------+






    




| Initial | 85us         | 1.38643           | 0.693139                          |             |






    




+---------+--------------+-------------------+-----------------------------------+-------------+






    




| 1       | 1.54s        | 1.38538           | 0.691463                          | 0.000190735 |






    




| 2       | 3.09s        | 1.38529           | 0.689766                          | 0.000190735 |






    




| 3       | 4.64s        | 1.3855            | 0.688442                          | 0.000190735 |






    




| 4       | 6.17s        | 1.38603           | 0.687318                          | 0.000190735 |






    




| 5       | 7.68s        | 1.38688           | 0.686364                          | 0.000190735 |






    




| 6       | 9.21s        | 1.38799           | 0.685558                          | 0.000190735 |






    




| 7       | 10.74s       | 1.38946           | 0.684931                          | 0.000190735 |






    




| 8       | 12.60s       | 1.39114           | 0.684416                          | 0.000190735 |






    




| 9       | 14.37s       | 1.39332           | 0.684127                          | 0.000190735 |






    




| 10      | 16.60s       | 1.39561           | 0.683958                          | 0.000190735 |






    




+---------+--------------+-------------------+-----------------------------------+-------------+






    




Optimization Complete: Maximum number of passes through the data reached.






    




Computing final objective value and training Predictive Error.






    




       Final objective value: 1.39739






    




       Final training Predictive Error: 0.683917



In [35]:

    
m3 = gl.ranking_factorization_recommender.create(train, 
                                                 item_data=items[['item_id', 'year', 'genres']], 
                                                 max_iterations=10)









    




Recsys training: model = ranking_factorization_recommender






    




Preparing data set.






    




    Data has 556371 observations with 6038 users and 3529 items.






    




    Data prepared in: 0.619754s






    




Training ranking_factorization_recommender for recommendations.






    




+--------------------------------+--------------------------------------------------+----------+






    




| Parameter                      | Description                                      | Value    |






    




+--------------------------------+--------------------------------------------------+----------+






    




| num_factors                    | Factor Dimension                                 | 32       |






    




| regularization                 | L2 Regularization on Factors                     | 1e-09    |






    




| solver                         | Solver used for training                         | adagrad  |






    




| linear_regularization          | L2 Regularization on Linear Coefficients         | 1e-09    |






    




| binary_target                  | Assume Binary Targets                            | True     |






    




| side_data_factorization        | Assign Factors for Side Data                     | True     |






    




| max_iterations                 | Maximum Number of Iterations                     | 10       |






    




+--------------------------------+--------------------------------------------------+----------+






    




  Optimizing model using SGD; tuning step size.






    




  Using 69546 / 556371 points for tuning the step size.






    




+---------+-------------------+------------------------------------------+






    




| Attempt | Initial Step Size | Estimated Objective Value                |






    




+---------+-------------------+------------------------------------------+






    




| 0       | 10                | Not Viable                               |






    




| 1       | 2.5               | Not Viable                               |






    




| 2       | 0.625             | Not Viable                               |






    




| 3       | 0.15625           | Not Viable                               |






    




| 4       | 0.0390625         | No Decrease (1.70989 >= 1.38659)         |






    




| 5       | 0.00976562        | No Decrease (1.86695 >= 1.38659)         |






    




| 6       | 0.00244141        | No Decrease (1.42815 >= 1.38659)         |






    




| 7       | 0.000610352       | No Decrease (1.39472 >= 1.38659)         |






    




| 8       | 0.000152588       | 1.38591                                  |






    




| 9       | 7.62939e-05       | 1.38605                                  |






    




| 10      | 3.8147e-05        | 1.38615                                  |






    




| 11      | 1.90735e-05       | 1.38623                                  |






    




+---------+-------------------+------------------------------------------+






    




| Final   | 0.000152588       | 1.38591                                  |






    




+---------+-------------------+------------------------------------------+






    




Starting Optimization.






    




+---------+--------------+-------------------+-----------------------------------+-------------+






    




| Iter.   | Elapsed Time | Approx. Objective | Approx. Training Predictive Error | Step Size   |






    




+---------+--------------+-------------------+-----------------------------------+-------------+






    




| Initial | 109us        | 1.38659           | 0.693033                          |             |






    




+---------+--------------+-------------------+-----------------------------------+-------------+






    




| 1       | 2.03s        | 1.38588           | 0.688326                          | 0.000152588 |






    




| 2       | 4.03s        | 1.38594           | 0.686816                          | 0.000152588 |






    




| 3       | 6.01s        | 1.38709           | 0.685309                          | 0.000152588 |






    




| 4       | 7.99s        | 1.38863           | 0.684032                          | 0.000152588 |






    




| 5       | 9.94s        | 1.39058           | 0.682958                          | 0.000152588 |






    




| 6       | 11.92s       | 1.39261           | 0.682088                          | 0.000152588 |






    




| 7       | 13.90s       | 1.3949            | 0.681394                          | 0.000152588 |






    




| 8       | 16.67s       | 1.39736           | 0.680825                          | 0.000152588 |






    




| 9       | 19.45s       | 1.40008           | 0.680407                          | 0.000152588 |






    




| 10      | 22.26s       | 1.40275           | 0.680151                          | 0.000152588 |






    




+---------+--------------+-------------------+-----------------------------------+-------------+






    




Optimization Complete: Maximum number of passes through the data reached.






    




Computing final objective value and training Predictive Error.






    




       Final objective value: 1.40473






    




       Final training Predictive Error: 0.680026

Evaluation

Create a precision/recall plot to compare the recommendation quality of the above models given our heldout data.



In [40]:

    
model_comparison = gl.compare(valid, [m0, m1, m2, m3], user_sample=.3)









    



compare_models: using 297 users to estimate model performance
PROGRESS: Evaluate model M0

Precision and recall summary statistics by cutoff
+--------+----------------+-----------------+
| cutoff | mean_precision |   mean_recall   |
+--------+----------------+-----------------+
|   1    | 0.340067340067 | 0.0273701812558 |
|   2    | 0.308080808081 | 0.0478083726971 |
|   3    | 0.288439955107 | 0.0644063022978 |
|   4    | 0.273569023569 | 0.0837581789951 |
|   5    | 0.259259259259 |  0.097804796748 |
|   6    | 0.246913580247 |  0.110896121437 |
|   7    | 0.239057239057 |  0.120171306579 |
|   8    | 0.231902356902 |  0.133021390364 |
|   9    | 0.21922933034  |  0.140607202562 |
|   10   | 0.211111111111 |  0.150910548487 |
+--------+----------------+-----------------+
[10 rows x 3 columns]

PROGRESS: Evaluate model M1

Precision and recall summary statistics by cutoff
+--------+----------------+-----------------+
| cutoff | mean_precision |   mean_recall   |
+--------+----------------+-----------------+
|   1    | 0.208754208754 | 0.0196167645986 |
|   2    | 0.185185185185 | 0.0325496617873 |
|   3    | 0.179573512907 | 0.0423502309465 |
|   4    | 0.172558922559 | 0.0516731283008 |
|   5    | 0.165656565657 | 0.0626777457678 |
|   6    | 0.156565656566 | 0.0708693455856 |
|   7    | 0.151996151996 | 0.0777337348093 |
|   8    | 0.144781144781 | 0.0849421423653 |
|   9    | 0.140665918444 | 0.0912976245018 |
|   10   | 0.135353535354 | 0.0963525147845 |
+--------+----------------+-----------------+
[10 rows x 3 columns]

PROGRESS: Evaluate model M2

Precision and recall summary statistics by cutoff
+--------+-----------------+------------------+
| cutoff |  mean_precision |   mean_recall    |
+--------+-----------------+------------------+
|   1    |  0.10101010101  | 0.00630900298195 |
|   2    | 0.0942760942761 | 0.0107194435737  |
|   3    |  0.107744107744 | 0.0214553505285  |
|   4    |  0.106902356902 | 0.0282813326662  |
|   5    |  0.104377104377 | 0.0371780264877  |
|   6    |  0.104938271605 | 0.0455064168293  |
|   7    |  0.101491101491 | 0.0499579851595  |
|   8    |  0.101430976431 | 0.0556392248383  |
|   9    |  0.104377104377 | 0.0633802772197  |
|   10   |  0.103703703704 | 0.0683569645247  |
+--------+-----------------+------------------+
[10 rows x 3 columns]

PROGRESS: Evaluate model M3

Precision and recall summary statistics by cutoff
+--------+-----------------+-----------------+
| cutoff |  mean_precision |   mean_recall   |
+--------+-----------------+-----------------+
|   1    |  0.144781144781 | 0.0120901537519 |
|   2    |  0.116161616162 | 0.0187377280862 |
|   3    | 0.0976430976431 | 0.0222244648549 |
|   4    | 0.0993265993266 | 0.0295933844723 |
|   5    | 0.0976430976431 | 0.0386882538229 |
|   6    | 0.0925925925926 | 0.0426572190333 |
|   7    | 0.0899470899471 | 0.0483395580037 |
|   8    |  0.087962962963 | 0.0517209152979 |
|   9    | 0.0845491956603 | 0.0557697749298 |
|   10   | 0.0814814814815 | 0.0601898874811 |
+--------+-----------------+-----------------+
[10 rows x 3 columns]

PROGRESS: Evaluate model M4

Precision and recall summary statistics by cutoff
+--------+-----------------+------------------+
| cutoff |  mean_precision |   mean_recall    |
+--------+-----------------+------------------+
|   1    |  0.013468013468 | 0.00107730896088 |
|   2    | 0.0117845117845 | 0.00128640444201 |
|   3    |  0.013468013468 | 0.00318878244088 |
|   4    |  0.013468013468 | 0.0051776828255  |
|   5    | 0.0127946127946 | 0.0055468443961  |
|   6    |  0.013468013468 | 0.00650871005558 |
|   7    |  0.013468013468 | 0.00751352734827 |
|   8    | 0.0130471380471 | 0.00891898234022 |
|   9    |  0.013468013468 | 0.00981078819657 |
|   10   |  0.013468013468 | 0.0112822641671  |
+--------+-----------------+------------------+
[10 rows x 3 columns]

Model compare metric: precision_recall



In [24]:

    
gl.show_comparison(model_comparison, [m0, m1, m2, m3, m5])



In [ ]:

item_id	genres	title	year	num_users
1	{"Children's": 1, 'Comedy': 1, 'Animati ...	Toy Story	1995	num_users_4
2	{"Children's": 1, 'Adventure': 1, ...	Jumanji	1995	num_users_3
3	{'Romance': 1, 'Comedy': 1} ...	Grumpier Old Men	1995	num_users_3
4	{'Drama': 1, 'Comedy': 1}	Waiting to Exhale	1995	num_users_2
5	{'Comedy': 1}	Father of the Bride Part II ...	1995	num_users_2
6	{'Action': 1, 'Thriller': 1, 'Crime': 1} ...	Heat	1995	num_users_3
7	{'Romance': 1, 'Comedy': 1} ...	Sabrina	1995	num_users_3
8	{"Children's": 1, 'Adventure': 1} ...	Tom and Huck	1995	num_users_2
9	{'Action': 1}	Sudden Death	1995	num_users_2
10	{'Action': 1, 'Adventure': 1, ...	GoldenEye	1995	num_users_3